US Airline Reliability Analysis Documentation¶

1. Introduction¶

1.1 Background of Project and Dataset¶

This document explains how various visualisation concepts were used to derive insights for airline operators based on a dataset of flight delays and cancellations. This dataset is available on kaggle and originates from the US DOT flight statistics web page, which is official and open source. This base data contains 293k data points and 21 features, and will subsequently be augmented with more recent data and additional information on airports. A brief description of the data fields in the base dataset is shown below:

  • year

  • month

  • carrier : Abbreviation of carrier

  • carrier_name : the actual carrier name

  • airport : Abbreviation of airbort

  • airport_name : the actual airport name

  • arr_flights: Number of flights arrived the airport.

  • arr_del15 : Number of flights delayed.

  • carrier_ct: Number of flights delayed due to air carrier

  • weather_ct: Number of flights delayed due to weather.

  • nas_ct: Number of flights delayed due to National Aviation System ( non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control ) check more in here%3A,volume%2C%20and%20air%20traffic%20control.)

  • security_ct: Number of flights delayed due to security

  • late_aircraft_ct: Number of flights delayed due to a previous flight.

  • arr_cancelled: Number of flight that has been cancelled.

  • arr_diverted: Number of flight that has been diverted.

  • arr_delay: time of delayed flights.

  • carrier_delay: time of delayed flights due to air carrier.

  • weather_delay: time of delayed flights due to weather.

  • nas_delay:time of delayed flights due to National Aviation System.

  • security_delay:time of delayed flights due to security.

  • late_aircraft_delay:time of delayed flights due to a previous flight.

1.2 Objectives of Analysis¶

For the target audience of airline operators, we identify three hypothesis and explain how they are useful to airline operators:

  1. Hypothesis 1 concerns airline reliability measured in terms of punctuality and delay times. We analyse how this reliability changed over time and propose business insights for how airlines can further improve their reliability.
  2. Hypothesis 2 is an outlier analysis and examines how airlines were affected by the COVID pandemic. We visualise which airlines and areas were more affected by flight cancellations during this time to identify regions requiring more targeted measures to improve reliability.
  3. Hypothesis 3 goes beyond factors that are out of control for airlines (weather, season) and explores how airlines can further improve reliability through system maintenance and closer synergy with airports.
1.3 Setting Up the Notebook¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

2. Hypothesis 1¶

The first hypothesis is that the overall reliability of US airlines has improved over time and is expected to continue improving in the future. In this analysis, we will use the flight punctuality rate and average delay time to represent reliability. To investigate this hypothesis, we first group the data by year, calculate the overall punctuality rate and average delay time of each year, and plot the data on a graph. On the same graph, we perform linear regressions on both the punctuality rate data and average delay time data to observe the trends. In the second graph, we break down the delay rate and average delay time into different delay factors to analyze how each factor has changed individually. In the third graph, we create a correlation matrix of delayed factors to investigate the correlation between them.

2.1 Data Loading¶
In [5]:
# Load the data
data = pd.read_csv('airline delay causes.csv')
data.head()
Out[5]:
year month carrier carrier_name airport airport_name arr_flights arr_del15 carrier_ct weather_ct ... security_ct late_aircraft_ct arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay security_delay late_aircraft_delay
0 2009 11 AS Alaska Airlines Inc. BRW Barrow, AK: Wiley Post/Will Rogers Memorial 65.0 14.0 3.61 1.06 ... 0.0 6.19 0.0 1.0 920.0 592.0 34.0 78.0 0.0 216.0
1 2009 11 AS Alaska Airlines Inc. BUR Burbank, CA: Bob Hope 85.0 5.0 1.00 0.00 ... 0.0 0.98 0.0 0.0 134.0 25.0 0.0 70.0 0.0 39.0
2 2009 11 AS Alaska Airlines Inc. CDV Cordova, AK: Merle K Mudhole Smith 58.0 8.0 2.04 1.00 ... 0.0 2.96 4.0 2.0 586.0 174.0 37.0 62.0 0.0 313.0
3 2009 11 AS Alaska Airlines Inc. DCA Washington, DC: Ronald Reagan Washington National 88.0 11.0 1.99 0.00 ... 0.0 0.00 0.0 0.0 266.0 65.0 0.0 201.0 0.0 0.0
4 2009 11 AS Alaska Airlines Inc. DEN Denver, CO: Denver International 123.0 5.0 3.47 0.00 ... 0.0 0.00 0.0 0.0 81.0 57.0 0.0 24.0 0.0 0.0

5 rows × 21 columns

2.2 Overall Trend of Punctuality and Average Delay Time¶
In [7]:
# Filter the data to only include years 2003-2019 (exclude 2020)
data = data[data['year'].between(2003, 2019)]

# Group the data by year and calculate the overall punctuality rate and average delay time
grouped_data = data.groupby('year')[['arr_flights', 'arr_del15', 'arr_cancelled', 'arr_delay']].sum().reset_index()
grouped_data['punctuality_rate'] = (grouped_data['arr_flights'] - grouped_data['arr_del15'] - grouped_data['arr_cancelled']) / grouped_data['arr_flights'] * 100
grouped_data['avg_delay_time'] = grouped_data['arr_delay'] / grouped_data['arr_del15']
In [8]:
# Perform linear regression on the punctuality rate data
x1 = grouped_data['year']
y1 = grouped_data['punctuality_rate']
coeffs1 = np.polyfit(x1, y1, 1)
m1 = coeffs1[0]
b1 = coeffs1[1]

# Perform linear regression on the average delay time data
x2 = grouped_data['year']
y2 = grouped_data['avg_delay_time']
coeffs2 = np.polyfit(x2, y2, 1)
m2 = coeffs2[0]
b2 = coeffs2[1]

# Create the line chart
fig, ax1= plt.subplots()

color1 = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Punctuality Rate (%)', color=color1)
ax1.plot(x1, y1, color=color1, label='Punctuality Rate')
ax1.tick_params(axis='y', labelcolor=color1)
ax1.plot(x1, m1*x1+b1, color=color1, linestyle='--', label='Punctuality Rate Regression Line')
plt.legend(loc='upper left')

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color2 = 'tab:blue'
ax2.set_ylabel('Average Delay Time (minutes)', color=color2)
ax2.plot(x2, y2, color=color2, label='Average Delay Time in dealyed flight')
ax2.tick_params(axis='y', labelcolor=color2)
ax2.plot(x2, m2*x2+b2, color=color2, linestyle='--', label='Average Delay Time Regression Line')
plt.legend(loc='upper right')

# Customize the layout of the chart
fig.tight_layout() 
plt.title('Overall Punctuality Rate and Average Delay Time Over Time')
plt.show();
2.3 Breakdown into Delay Factors¶
In [10]:
# Create a figure with two subplots
fig, ax = plt.subplots(1, 2, figsize=(16, 6))

# Calculate the percentage of delayed flights due to different reasons for each year
delayed_by_carrier = data.groupby('year')['carrier_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_weather = data.groupby('year')['weather_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_nas = data.groupby('year')['nas_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_security = data.groupby('year')['security_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_late_aircraft = data.groupby('year')['late_aircraft_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100

# Create a single area chart for all delay factors in the first subplot
ax[0].stackplot(delayed_by_carrier.index, delayed_by_carrier, delayed_by_weather, delayed_by_nas, delayed_by_security, delayed_by_late_aircraft, labels=['Air Carrier', 'Weather', 'NAS', 'Security', 'Late Aircraft'])

# Customize the layout of the first chart
ax[0].set_title('Percentage of Delayed Flights Due to Different Reasons Over Time')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Percentage of Delayed Flights')
ax[0].legend()

# Calculate the average delayed time due to different reasons for each year
delayed_by_carrier = data.groupby('year')['carrier_delay'].sum() / data.groupby('year')['carrier_ct'].sum() 
delayed_by_weather = data.groupby('year')['weather_delay'].sum() / data.groupby('year')['weather_ct'].sum() 
delayed_by_nas = data.groupby('year')['nas_delay'].sum() / data.groupby('year')['nas_ct'].sum()
delayed_by_security = data.groupby('year')['security_delay'].sum() / data.groupby('year')['security_ct'].sum() 
delayed_by_late_aircraft = data.groupby('year')['late_aircraft_delay'].sum() / data.groupby('year')['late_aircraft_ct'].sum()

# Create a line chart with a separate trace for each delay factor in the second subplot
ax[1].plot(delayed_by_carrier.index, delayed_by_carrier, label='Air Carrier')
ax[1].plot(delayed_by_weather.index, delayed_by_weather, label='Weather')
ax[1].plot(delayed_by_nas.index, delayed_by_nas, label='NAS')
ax[1].plot(delayed_by_security.index, delayed_by_security, label='Security')
ax[1].plot(delayed_by_late_aircraft.index, delayed_by_late_aircraft, label='Late Aircraft')

# Customize the layout of the second chart
ax[1].set_title('Average delay time due to different reasons over time')
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Average delay time (minutes)')
ax[1].legend()

# Show the figure
plt.show()
2.4 Correlation Between Delay Factors¶
In [11]:
delay_factors = ['carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']
delay_data = data[delay_factors]

correlations = delay_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlations, cmap='coolwarm', annot=True, fmt='.2f')
plt.title('correlation matrix between flight delay factors')
plt.show()

3. Hypothesis 2¶

The second hypothesis is major airlines were less affected by flight cancellations during the COVID-19 pandemic, which drastically altered the air travel landscape. As air travel gradually recovers to pre-pandemic levels, airline operator's might find it useful to understand how their operations were affected by the pandemic so they can implement effective recovery measures. After all, domestic travel accounts for almost 70% of the revenue for some of these companies. In the following section, we therefore visualise how domestic flights were affected during this period.

3.1 Data Loading¶
In [1]:
"""
In this section, we import the input data from a CSV file and perform some simple preprocessing.
"""
#Setting up Notebook
import numpy as np
import pandas as pd

#Reading Input Files
flightData = pd.read_csv('Airline_Delay_Cause_New.csv')
cancelData = flightData[["year", "month", "carrier_name", "airport", "arr_flights", "arr_cancelled"]] #retain relevant columns
cancelData.dropna()
cancelData.head()
Out[1]:
year month carrier_name airport arr_flights arr_cancelled
0 2022 11 Endeavor Air Inc. ABY 86.0 0.0
1 2022 11 Endeavor Air Inc. AEX 59.0 0.0
2 2022 11 Endeavor Air Inc. AGS 21.0 0.0
3 2022 11 Endeavor Air Inc. ALB 98.0 3.0
4 2022 11 Endeavor Air Inc. ATL 1143.0 3.0
3.2 Spike in Flight Cancellation Rates in 2020¶
In [2]:
"""
In this section, we compute and aggregage the flight cancellation statistics for each airline carrier by month. This code takes
1-2 minutes to run.
"""
from datetime import datetime

#Creating Dictionary to Store Cancellation Rate
cancelAirlineDict = pd.Series(index=cancelData.carrier_name).to_dict()
for key in cancelAirlineDict.keys():
    cancelAirlineDict[key] = []

#Cleaning and Processing Input Data
prevCarrier = cancelData["carrier_name"][0]
prevMonth = cancelData["month"][0]
numFlights = 0
numCancel = 0
arrLen = 0
date = []

for i in cancelData.index:
    
    #Checking that All Airlines have Entries for Each Month
    if cancelData.iloc[i]["month"] != prevMonth:
        arrLen += 1
        for key,value in cancelAirlineDict.items():
            if len(value) < arrLen:
                cancelAirlineDict[key].append(0)
            if len(value) > arrLen:
                cancelAirlineDict[key] = cancelAirlineDict[key][:-1]
        timestamp = str(cancelData.iloc[i]["year"]) + "/" + str(cancelData.iloc[i]["month"]) + "/01" 
        date.append(datetime.strptime(timestamp, '%Y/%m/%d'))
    
    #Cumulatively adding Cancellation Rate Parameters
    if cancelData.iloc[i]["carrier_name"] == prevCarrier:
        numFlights += cancelData.iloc[i]["arr_flights"]
        numCancel += cancelData.iloc[i]["arr_cancelled"]
    
    #Calculating Cancellation Rate and Reset Boiler Plate
    else:
        if np.isnan(numFlights) or np.isnan(numCancel):
            cancelAirlineDict[prevCarrier].append(0)
        else:
            cancelAirlineDict[prevCarrier].append(round(numCancel/numFlights, 3))
        
        numFlights = cancelData.iloc[i]["arr_flights"]
        numCancel = cancelData.iloc[i]["arr_cancelled"]
        prevCarrier = cancelData.iloc[i]["carrier_name"]
        prevMonth = cancelData.iloc[i]["month"]

#Preprocessing for Final Round
timestamp = str(cancelData.iloc[-1]["year"]) + "/" + str(cancelData.iloc[-1]["month"]) + "/01"
date.append(datetime.strptime(timestamp, '%Y/%m/%d'))
    
cancelAirlineDict[prevCarrier].append(numCancel/numFlights)
arrLen += 1
for key,value in cancelAirlineDict.items():
    if len(value) < arrLen:
        cancelAirlineDict[key].append(0)
    if len(value) > arrLen:
        cancelAirlineDict[key] = cancelAirlineDict[key][:-1]

#Calculating Average Cancellation Rates
cancelAirlineAv = []
for i in range(0,arrLen):
    cumSum = 0
    for key,value in cancelAirlineDict.items():
        cumSum += value[i]
    cancelAirlineAv.append(cumSum/len(cancelAirlineDict.keys()))
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\1947007629.py:8: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  cancelAirlineDict = pd.Series(index=cancelData.carrier_name).to_dict()

We are now ready to examine how COVID has affected flight cancellations We plot the cancellation rate for airlines over the years (rate is preferred over the nominal cancellation numbers to account for the different numbers of flight across carriers). We observe a sigificant spike in 2020, corresponding to the period where the impact of COVID was most disruptive. The plot also shows the flight cancellations were particularly bad for Alliegiant, Southwest, and American Airlines. In contrast, Frontier, Alaska, and United Airlines were found to be more reliabile. The cancellation rate was about 19% at its highest.

In [3]:
"""
In this section, we visualise the flight cancellation rates.
"""
from pandas import Timestamp
import plotly.express as px
from plotly.graph_objs import *

#Aggregating Results
cancelAirline = pd.DataFrame.from_dict(cancelAirlineDict)
cancelAirline['date'] = date
cancelAirline['Average'] = cancelAirlineAv

#Plotting Graphs
fig = px.line(cancelAirline, x="date", y=['Endeavor Air Inc.', 'American Airlines Inc.', 'Alaska Airlines Inc.', 
                                          'JetBlue Airways', 'Delta Air Lines Inc.', 'Frontier Airlines Inc.', 'Allegiant Air',
                                          'Hawaiian Airlines Inc.', 'Envoy Air', 'Spirit Air Lines', 'PSA Airlines Inc.', 
                                          'SkyWest Airlines Inc.', 'Horizon Air', 'United Air Lines Inc.', 
                                          'Southwest Airlines Co.', 'Mesa Airlines Inc.', 'Republic Airline',
                                          'ExpressJet Airlines LLC', 'ExpressJet Airlines Inc.', 'Virgin America', 'Average'], 
             labels = {"date": "Date", "value": "Cancellation Rate", "variable": "Airline"}, 
              title= "Airline Cancellation Rates")
fig.update_layout(height=615, paper_bgcolor="rgba(0,0,0,0)", plot_bgcolor="rgba(192,192,192,0.2)")
fig.for_each_trace(
    lambda trace: trace.update(line_color="lightgrey") if trace.name != "Average" else (),
)
fig.show()
3.3 Cancellation Trend Among Best and Wost Performers¶

From our analysis above, we identified three airlines that were particularly unreliable and another three that were particularly reliable. We conduct a further analysis on the actual number of delays and examine how each airline was affected by COVID.

In [4]:
"""
In this section, we extract the relevant columns from our input file to run the analysis on airline delay times.
"""
delayData = flightData[["year", "carrier_name", "arr_cancelled"]] #retain relevant columns
delayData.dropna()
delayData.head()
Out[4]:
year carrier_name arr_cancelled
0 2022 Endeavor Air Inc. 0.0
1 2022 Endeavor Air Inc. 0.0
2 2022 Endeavor Air Inc. 0.0
3 2022 Endeavor Air Inc. 3.0
4 2022 Endeavor Air Inc. 3.0
In [5]:
"""
In this section, we aggregate the number of cancellations of the six selected airlines: Allegiant, Southwest, American, 
Frontier, Alaska, and United Airlines. This code takes about 1 minute to run.
"""
#Creating Arrays to Store Cancellation Parameters
year = []
airline = []
cancel =[]

#Cleaning and Processing Input Data
for i in delayData.index:   
    if delayData.iloc[i]["carrier_name"] in ["Allegiant Air", "Southwest Airlines Co.", "American Airlines Inc.", 
                                            "Frontier Airlines Inc.", "Alaska Airlines Inc.", "United Air Lines Inc."]:
        year.append(delayData.iloc[i]["year"])
        airline.append(delayData.iloc[i]["carrier_name"])
        cancel.append(delayData.iloc[i]["arr_cancelled"])

#Aggregating Results
delayAirline = pd.DataFrame(data={"Year": year, "Airline": airline, "Cancellations": cancel})

We observe that the year 2020 contained noticeable increases in both the average number of cancellations and variability in the number of cancellations. Note that we cannot simply compare the number of cancellations across airlines as these are nominal figures and each airlines flies a different number of routes. Compared to their usual reliability levels, Allegiant Airlines had noticeably more flight cancellations. Major airlines such as Southwest and American airlines have been better able to cope with the pandemic as the number of flight cancellations did not increase much in that period. The variablity in number of cancellations, however, did increase, suggesting that some airports were more affected by flight cancellations on some months. In contrast, the changes in cancellations for Allegiant Air are more pronounced in 2020. Therefore, by both measures of reliability (cancellation rate, reliability compared to past years), Alliegiant Airline performed more poorly.

In [6]:
"""
In this section, we visualise the change in flight cancellation over the years for the selected airlines.
"""
from plotnine import ggplot, aes, facet_wrap, scales, geom_boxplot, labs

(
    ggplot(delayAirline, aes('Year', 'Cancellations', color='Airline'))
    + geom_boxplot(aes(x='factor(Year)', y='Cancellations'))
    + scales.scale_y_log10()
    + facet_wrap('~Airline')
    + labs(title='Cancellations per airport per month')
)
C:\Users\Tan Ning Xuan\anaconda3\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: divide by zero encountered in log10
C:\Users\Tan Ning Xuan\anaconda3\lib\site-packages\plotnine\layer.py:333: PlotnineWarning: stat_boxplot : Removed 15156 rows containing non-finite values.
Out[6]:
<ggplot: (132280935927)>
3.4 Allegiant Air Recovery Focus Areas¶

We can go one step further to help one of these airlines with poorer reliability improve. From the data, we can identify which destinations were more impacted by cancellations so the airlines know where to best channel their resources to minimise flight cancellations. In the following section, we pick Allegiant Air as the example for our analysis.

In [7]:
"""
In this section, we extract the relevant columns from our input file to run the reliability analysis at the route level. 
"""
#Reading Input File
airportData = flightData[["year", "carrier_name", "airport", "arr_cancelled"]] #retain relevant columns
airportData.dropna()
airportData.head()

#Reading Additional Input File
#coordinatesData = pd.read_csv('Coordinates.csv')
Out[7]:
year carrier_name airport arr_flights arr_cancelled
0 2022 Endeavor Air Inc. ABY 86.0 0.0
1 2022 Endeavor Air Inc. AEX 59.0 0.0
2 2022 Endeavor Air Inc. AGS 21.0 0.0
3 2022 Endeavor Air Inc. ALB 98.0 3.0
4 2022 Endeavor Air Inc. ATL 1143.0 3.0
In [10]:
"""
In this section, we aggregate the cancellation numbers for each destination of Allegiant Air for the years of 2018-2019 and
2020 to examine the difference before and after the pandemic. This code takes about 1 minute to run. 
"""
#Creating Dictionaries to Store Cancellation Rate Parameters
cancel1819 = pd.Series(index=airportData.airport).to_dict()
cancel20 = pd.Series(index=airportData.airport).to_dict()
difference = pd.Series(index=airportData.airport).to_dict()
for key in cancel1819.keys():
    cancel1819[key] = 0
    cancel20[key] = 0

#Cleaning and Processing Input Data
for i in airportData.index:
    if airportData.iloc[i]["carrier_name"] == "Allegiant Air":
        if airportData.iloc[i]["year"]==2018 or airportData.iloc[i]["year"]==2019:
            cancel1819[airportData.iloc[i]["airport"]] += airportData.iloc[i]["arr_cancelled"]
        elif airportData.iloc[i]["year"]==2020:
            cancel20[airportData.iloc[i]["airport"]] += airportData.iloc[i]["arr_cancelled"]

#Aggregating Results
for key in difference.keys():
    if cancel20[key]>0 and cancel1819[key]>0:
        difference[key] = cancel20[key] - cancel1819[key]
airportDiff = pd.DataFrame(difference.items(), columns=['FAA Code', 'Difference'])
airportDiff.dropna(inplace=True)
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:6: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:7: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:8: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
In [11]:
"""
In this section, we import two new files: one containing the list of airport codes, and another containing their GPS 
coordinates. We combine these data with our cancellation rate data from the preceding section. We then use an API developed
by the US FCC to obtain the FIPS codes for each airport. This data will subsequently be used to plot the choropleth. This code
takes about 2-3 minutes to run.
"""
import requests

#Defining Helper Function to Obtain FIPS Code
def latlong2fips (latitude, longitude):
    url = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%f&longitude=%f" % (latitude, longitude)
    r = requests.get(url)
    return r.json()['County']['FIPS']

#Importing Additional Files
codeData = pd.read_csv('Airport_Codes.csv')
codeData.dropna()
GPSData = pd.read_csv('US_Airport_Database.csv')
GPSData.dropna()

#Combining Data
Fips = []
airportNames = []

for faa in airportDiff["FAA Code"]:
    idx = codeData.index[codeData['FAA'] == faa][0] #find matching FAA code
    airportNames.append(codeData.iloc[idx]["Airport"])
    icao = codeData.iloc[idx]["ICAO"]
    idx2 = GPSData.index[GPSData['ICAO'] == icao][0] #find matching ICAO code
    lat,long = GPSData.iloc[idx2]["LATITUDE"], GPSData.iloc[idx2]["LONGITUDE"]
    fip = latlong2fips(lat, long) #find matching coordinates
    Fips.append(fip)

airportDiff["Fips"] = Fips
airportDiff["Name"] = airportNames

We now visualise this difference in cancellation rates on a choropleth map. From the data, we can readily identify areas which Allegiant Air has suffered more cancellations during the COVID period than usual: Harry Reid (NV) and Orlando Sanford (FL). They should therefore focus more of their supporting resources in these areas to recover more quickly from the pandemic.

In [13]:
"""
In this section, we visualise the difference in flight cancellations before and after COVID. This code takes about 1 minute
to run. 
"""
from urllib.request import urlopen
import json

#Plotting Map
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
fig = px.choropleth_mapbox(airportDiff, geojson=counties, locations='Fips', color='Difference',
                           color_continuous_scale="Bluered", mapbox_style="carto-positron", zoom=3,
                           center={"lat":37.0902,"lon":-95.7129}, opacity=0.5, hover_name='Name')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

4. Hypothesis 3¶

The third hypothesis is "Airlines in the airports of their headquarters should have a lower delay rate". To argue this point, we firstly select 9 ongoing operated airlines and corresponding 9 headquarter cities (may involve duplicates), computed average delay rate for these airlines in the cities, and produced ordered bar charts for comparison. After generating charts, we analyze any potential trend and dive into what and why the outlier airlines occur.

4.1 Data Loading¶
  • Read dataset: Airline_Delay_Cause.csv as variable airline_delay
  • Compute delay rate from existing columns: the number of delayed flights (arr_del15) over the number of arrival flights (arr_flights)
  • Drop any empty value from the dataset
In [12]:
# read the first dataset
airline_delay = pd.read_csv("airline delay causes.csv")

# compute delay rate from existing columns
airline_delay["delay_prop"] = 100 * airline_delay["arr_del15"] / airline_delay["arr_flights"]
airline_delay.dropna(inplace = True)
airline_delay.head(n = 3)
Out[12]:
year month carrier carrier_name airport airport_name arr_flights arr_del15 carrier_ct weather_ct ... late_aircraft_ct arr_cancelled arr_diverted arr_delay carrier_delay weather_delay nas_delay security_delay late_aircraft_delay delay_prop
0 2009 11 AS Alaska Airlines Inc. BRW Barrow, AK: Wiley Post/Will Rogers Memorial 65.0 14.0 3.61 1.06 ... 6.19 0.0 1.0 920.0 592.0 34.0 78.0 0.0 216.0 21.538462
1 2009 11 AS Alaska Airlines Inc. BUR Burbank, CA: Bob Hope 85.0 5.0 1.00 0.00 ... 0.98 0.0 0.0 134.0 25.0 0.0 70.0 0.0 39.0 5.882353
2 2009 11 AS Alaska Airlines Inc. CDV Cordova, AK: Merle K Mudhole Smith 58.0 8.0 2.04 1.00 ... 2.96 4.0 2.0 586.0 174.0 37.0 62.0 0.0 313.0 13.793103

3 rows × 22 columns

In [13]:
"""
We filter the data with only airports and airlines we selected
"""
# record the headquarter airport of each selected airline
headquarter = {
    "UA" : "ORD", 
    "DL" : "ATL",
    "F9" : "DEN",
    "NK" : "MCO",
    "MQ" : "DFW",
    "YV" : "PHX",
    "AA" : "DFW",
    "AS" : "SEA",
    "VX" : "SFO"
}

# filter the data with only these airports and airlines (carriers)
airports = list(headquarter.values())
carriers = list(headquarter.keys())
df = airline_delay.query("carrier in @carriers & airport in @airports").reset_index(drop = True)
In [14]:
"""
Compute the average delay rate of each airline in each airport. If an airline doesn't operate in an airport, create a new row 
with delay_prop as 0. Round each rate with 2 decimal places and change zero rate as "no flight"
"""
# Compute the average delay rate of each airline in each airport
temp = df.groupby(["carrier","airport"])["delay_prop"].mean().reset_index()
for c in carriers:
    if temp["carrier"].value_counts()[c] != len(carriers):
        temp_df = temp.query("carrier == @c")

        ## append no flight rows
        for a in airports:
            if a not in list(temp_df["airport"]):
                temp = temp.append({
                    "carrier" : c,
                    "airport" : a,
                    "delay_prop" : 0
                }, ignore_index = True)

# round to 2 decimal places and change zero rate to "no flight" 
temp["labels"] = temp["delay_prop"].apply(lambda x : str(round(x, 2)) + "%" if x != 0 else "no flight")
4.2 Delay Rates at Headquarter Airports¶
In [15]:
"""
set the color code for each airline because our target audience is the airline officers, different color codes can highlight
the headquarter more obvious and easily found draw bar chart as subplot for each carrier and change the title and ticks properly
"""
# color code for each airline (searched on Google)
color_map = {
    "DL" : "#E3132C",
    "UA" : "#005daa",
    "F9" : "#248168",
    "NK" : "#FFD700",
    "AS" : "#00385F",
    "YV" : "#2B1B17",
    "AA" : "#086591",
    "VX" : "#F70D1A",
    "MQ" : "#1F45FC"
}

fig, ax = plt.subplots(3, 3, figsize = (22,16))
for i in range(0, 3):
    for j in range(0, 3):

        ## extract the current carrier and corresponding data, but will not involve no flight data for the beauty of chart
        c = carriers[3 * i + j]
        temp_c = temp.query("carrier == @c").sort_values(by = "delay_prop")
        temp_c = temp_c.query("labels != 'no flight'")

        ## draw a bar for each airport: if the airport is in the headquarter, use the specific color code, otherwise light grey
        colors = []
        y_axis = np.arange(temp_c.shape[0])
        for airport in list(temp_c["airport"]):
            if airport == headquarter[c]:
                colors.append(color_map[c])
            else:
                colors.append("#E5E4E2")

        ## setup the bar label, tick label, and bar chart title
        plot = ax[i,j].barh(width = tuple(temp_c["delay_prop"]), y = y_axis, color = colors)
        ax[i,j].bar_label(plot, labels = list(temp_c["labels"]), padding = 3)
        ax[i,j].set_yticks(y_axis, tuple(temp_c["airport"]))
        ax[i,j].set_xlim(0, 35)
        ax[i,j].set_title("Carrier " + c, fontsize = 13, fontweight = "bold")
        ax[i,j].set_xlabel("Avg Delay Rate (%)")

# setup the overall chart title
fig.suptitle("Average Delay Rate of Each Airline (Headquarter Highlighted)", fontsize = 25, fontweight = "bold")
plt.show()

From the chart above, we can see that 7 out 9 airlines have the least or relatively lower delay rate in their respective headquarters. This can be a convincing evidence to show that airlines in their headquarters tend to have a lower delay rate. But there are 2 outliers: Alaska Airline (AS) and Virgin America Airline (VX), which have a high delay rate in Seattle and San Francisco. We will focus on analyzing why it could happen and how to improve.

4.3 Analysis of Alaska Airline (AS) in the airport SEA¶

To analyze Alaska Airline in its headquarter Seattle, we compare the average proportion of each delay factor of AS in Seattle and across the US.

In [16]:
"""
Compute the average proportion for each factor, Alaska Airline only, in Seattle and the US overall
"""
tmp_y = ['carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct']

# across the US
as_factor = airline_delay.query("carrier == 'AS'")[tmp_y].mean()
as_factor_prop = 100 * as_factor / as_factor.sum()

# only Seattle
sea_factor = airline_delay.query("carrier == 'AS' and airport == 'SEA'")[tmp_y].mean()
sea_factor_prop = 100 * sea_factor / sea_factor.sum()
In [17]:
"""
Generate a bar chart for Seattle data and a dash line chart for the US overall
"""
# bar chart for Seattle data
plt.bar(sea_factor_prop.index, sea_factor_prop.values, alpha = 0.5, color = "#CB6D51")

# dash line chart for the US overall
plt.plot(as_factor_prop.index, as_factor_prop.values, "o--", color = "green")

# set up appropriate chart title, ticks, and labels
plt.title("Proportion of Delay Factors of Airline AS \n Across the US and in Airport SEA")
plt.xticks(np.arange(0,5), ["Carrier", "Weather", "NAS", "Security", "Late Aircraft"])
plt.ylabel("Proporation (%)")
plt.legend(["Across the US", "Only in SEA"], loc = "upper left")
plt.show()

From the chart above, we can observe that for Alaska Airline, the delay cause of NAS and late aircraft in Seattle is much high than across the US. The late aircraft is analyze in the first hypothesis, and the potential suggestion can be applied here that airlines should design the flight timetable not that dense to avoid this kind of delay. For the NAS part, we search some online news about the delays caused by the system and the airline's operation, and then provided insights based on these information.

4.4 Analysis of Virgin America (VX) in the airport SFO¶

The Virgin America (VX) airline is headquartered at San Francisco, but its average delay rate at San Francisco is not relatively lower than at other selected airports. To analyze this pheonomenon, we try to explore the problem from both the perspectives of SFO airport and VX airline.

In [18]:
"""
Average delay rate of each airline at SFO airport.
"""
# subset of dataset at airport SFO
sfo = airline_delay[airline_delay['airport'] == 'SFO']

# calculate the average delay rate by the group of each carrier, sort the new dataframe by value and reset the index
s = sfo.groupby('carrier')['delay_prop'].mean()
sorted_s = s.sort_values(ascending=False)
sorted_s = sorted_s.reset_index('carrier')


# set the length range of x axis and rename it by the carriers
x_range = range(0, len(s))
plt.xticks(x_range, sorted_s['carrier'])

# plot all the lollipops for each airline
plt.vlines(sorted_s['carrier'], ymin=0, ymax=sorted_s['delay_prop'], color='skyblue')
plt.scatter(sorted_s['carrier'], sorted_s['delay_prop'], color='skyblue', s=30)

# highligh the lollipop of airline VX in the chart
plt.vlines('VX', ymin=0, ymax=s.loc['VX'], color='orange')
plt.scatter('VX', s.loc['VX'], color='orange', s=30)

# rename the plot and display it
plt.title("Average delay rate of all airlines at SFO airport")
plt.show()

According to the lollipop chart above, it can be seen that the majority of airlines own a delay rate over 20% at San Francisco airport (SFO). Therefore, delays of flights at SFO tend to be a common problems across all airlines. Because of the FAA safety regulations and the limitations of runways in SFO, the airlines that arrive at SFO must land single-file with greater time intervals (SFO website, 2023).

To further validate this possible explanation, we try to compare the detailed delay factors of VX at SFO and at all airports in US.

In [19]:
"""
Delay factors of VX at SFO and across US.
"""
# only SFO
sfo_factor = airline_delay.query("carrier == 'VX' and airport == 'SFO'")[tmp_y].mean()
sfo_factor_prop = 100 * sfo_factor / sfo_factor.sum()

# bar chart for San Francisco data
plt.bar(sfo_factor_prop.index, sfo_factor_prop.values, alpha = 0.5, color = "#CB6D51")

# dash line chart for the US overall
plt.plot(as_factor_prop.index, as_factor_prop.values, "o--", color = "green")

# set up appropriate chart title, ticks, and labels
plt.title("Proportion of Delay Factors of Airline VX \n Across the US and in Airport SFO")
plt.xticks(np.arange(0,5), ["Carrier", "Weather", "NAS", "Security", "Late Aircraft"])
plt.ylabel("Proporation (%)")
plt.legend(["Across the US", "Only in SFO"], loc = "upper left")
plt.show()

According to the integrated graph of barchart and dash line, delays resulting from factor 'NAS' at SFO for VX is higher than its performance across US by about 15%, which is intuitively in accordance with the official explanation above.

In addition, VX airline deals with other factors better at SFO airport except for the landing limitation. This can still be a kind of corroboration of the hypothesis of this section.

5. References¶

  • [1] Kaggle. (2022) USA Airline Delay Cause. https://www.kaggle.com/datasets/mahmoudhassanmahmoud/airline-delay-20032022. [Accessed at 11th February 2023]
  • [2] Fox 13 Seattle. (2022) Outage causes 50 Alaska Airlines flights to be delayed at Sea-Tac Airport. https://www.q13fox.com/news/outage-causes-50-alaska-airlines-flights-to-be-delayed-at-sea-tac. [Accessed at 21st February 2023]
  • [3] Business Chief. (2020) Alaska Airlines Flights Delayed Due to Computer Glitch. https://businesschief.com/leadership-and-strategy/alaska-airlines-flights-delayed-due-computer-glitch. [Accessed at 21st February 2023]
  • [4] San Francisco International Airport. (2023) Airport Operations, Policies & Regulations, Weather Impact. https://www.flysfo.com/about/airport-operations/policies-regulations/weather-impact#:~:text=Runway%20constraints&text=SFO's%20runways%20are%20only%20750,to%20meter%20flights%20into%20SFO. [Accessed at 21st February 2023]
In [ ]: